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Abstract 

W e study the distribution of the adaptive LASSO estimator (jZoul 
([200^)) in finite samples as well as in the large-sample limit. The large- 
sample distributions are derived both for the case where the adaptive 
LASSO estimator is tuned to perform conservative model selection as 
well as for the case where the tuning results in consistent model selection. 
We show that the finite-sample as well as the large-sample distributions 
are typically highly non-normal, regardless of the choice of the tuning pa- 
rameter. The uniform convergence rate is also obtained, and is shown to 
be slower than n~^^^ in case the estimator is tuned to perform consistent 
model selection. In particular, these results question the statistical rele- 
vance of the 'oracle' property of the adaptive LASSO estimator established 
in lZoul HqOS). Moreover, we also provide an impossibility result regarding 
the estimation of the distribution function of the adaptive LASSO esti- 
mator. The theoretical results, which are obtained for a regression model 
with orthogonal design, are complemented by a Monte Carlo study using 
non-orthogonal regressors. 

MSG 2000 subject classification. Primary 62F11, 62F12, 62E15, 62J05, 
62J07. 

Key words and phrases. Penalized maximum likelihood, LASSO, adap- 
tive LASSO, nonnegative garotte, finite-sample distribution, asymptotic 
distribution, oracle property, estimation of distribution, uniform consis- 
tency. 



1 Introduction 



Penalized maximum likelihood estimators have been studied intensively in the 
last few years. A prominen t example i s the l east absolute selection and shrink- 
age (LASSO) estimator of IXibshiranil (Il996l). Related variants of the LASSO 
include the Bridge es timators studied b y lFrank fc FriedmanI (|l993l ). least angle 



regression (LARS) of Efron et al. (|2004|). or the smoothly clipped absolute devi 



ation (SCAD) estimator of lFan fc Lil (|200lh. Other estimators that fit into this 



framework are hard- and soft-thresholding estimators. While many properties 
of penalized maximum likelihood estimators are now well understood, the un- 
derstanding of their distributional properties, such as finite-sample and large- 
sample limit distributions, is still incomplete. Th e probably most important 
contribution in this respect is Knight fc Ftj ( 200Cl( ) who study the asymptotic 
distribution of the LASSO estimator (and of Bridge estimators more generally) 
when the tuning parameter governing the influence of the penalty term is chosen 
in such a way that the LASSO acts as a conservative model selection procedure 
(that is, a procedure that does not select underparameterized models asymp- 
totically, but selects overpa r amete rized models with positive probability asymp- 
totically). In Knight k, Fu ( 2000l ). the asymptotic distribution is obtained in a 
fixed-parameter as w ell as i n a st andard local alternatives setup. This is comple- 
mented by a result in Zou ( 20061 ) who considers the fixed-parameter asymptotic 
distribution of the LASSO whe n tuned to act as a consistent model selection pro- 
cedure. Another contribution is Fan fc Lil ( 2001 ) who derive the fixed-parameter 
asymptotic distribution of the SCAD estimator when the tuning parameter is 
chosen in such a way that the SCAD estimator performs consistent model se- 
lection; in^articular, they establish the so-called 'oracle' property for this esti- 
mator. Zou (|200d) introduced a variant of the LASSO, the so-called adaptive 
LASSO estimator, and established the 'oracle' property for this estimator when 
suitably tuned. Since it is well-known that fixed-parameter (i.e., pointwise) 
asymptotic results can give a wrong picture of the estimator's actua l behav- 
ior, es peci ally when the estimator p erforms model se l ection ( see, e.g., iKabaila 
or iLeeb fc Potscherl (|2005l ). IPotscher fc Leebl (|2007h . iLeeb fc Potscher 
), it is important to take a closer look at the actual distributional prop- 
erties of the adaptive LASSO estimator. 

In the present paper we study the finite-sample as well as the large-sample 
distribution of the adaptive LASSO estimator in a linear regression model. In 
particular, we study both the case where the estimator is tuned to perform con- 
servative model selection as well as the case where it is tuned to perform con- 
sistent model selection. We find that the finite-sample distributions are highly 
non- normal (e.g., are often multimodal) and that a standard fixed-parameter 
asymptotic analysis gives a highly misleading impression of the finite-sample be- 
havior. In particular, the 'oracle' property, which is based on a fixed-parameter 
asymptotic analysis, is shown not to provide a reliable assessment of the esti- 
mators' actual performance. For these reasons, we also obtain the large-sample 
distributions of the above mentioned estimators under a general "moving param- 
eter" asymptotic framework, which much better captures the actual behavior 
of the estimator. [Interestingly, it turns out that in case the estimator is tuned 
to perform consistent model selection a "moving parameter" asymptotic frame- 
work more general than the usual n~^/^-local asymptotic framework is necessary 
to exhibit the full range of possible limiting distributions.] Furthermore, we ob- 
tain the uniform convergence rate of the adaptive LASSO estimator and show 
that it is slower than rt~^/^ in the case where the estimator is tuned to per- 
form consistent model selection. This again exposes the misleading character 
of the 'oracle' property. We also show that the finite-sample distribution of the 
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adaptive LASSO estimator cannot be estimated in a ny reasonable sense, comple - 
menting results o f this s o rt in the literatur e such a s iLeeb fc Potschei JgOOGaHa ) , 



Potscher fc Leebl (|2007^ ■ iLeeb fc Potscheil (|2008a[) and iPotscheil (|2006t ). 

Apart from the papers already mentioned, there has been a recent surge of 
publications establishing the 'oracle' prope rty fo r a variet y of penalized maxi- 
mum likelihood or related estimators fe.g.. iBune a ( 2004 1 . iBunea fc McKeague 



mum nJcennood or rela ted es t imators (e.g . . l.aune a ( zom I . ittunea 6z iyicKeague 
(l2005h.lFan fc LilJ 2002. 20041. Li fc Lianel (l2007h.lWang fc Lend (l2007l).IWang. G. 
Wang. G. Li and Tsai (20071, Wang. R. Li and Tsai (2007lllYuan fc Linl (l200'; ' ~ 



Li and Jiang (20071 



Leeb fc Potscher 



Leeb fc Potscher 



Zhang & Lu (2007), Zou & Yuanl (|2008l) . IZou fc Lilj2008l ). I Johnson et al.l (|2008D 1. 
The 'oracle' property also paints a misleading picture of th e behavior of the 
estimators consider ed in these pap e rs; see the discus s ion in 
(,2005), Yang (2005l). |Potsc"hal (|2007l) . rPotscher fc Leebl (|2007t ). 
(2008bll . 

The paper is organized as follows. The model and the adaptive LASSO 
estimator are introduced in Section [2l In Section [3] wc study the estimator 
theoretically in an orthogonal linear regression model. In particular, the model 
selection probabilities implied by the adaptive LASSO estimator are discussed 
in Section [ST] Consistency, uniform consistency, and uniform convergence rates 
of the estimator are the subject of Section [221 The finite-sample distributions 
are derived in Section 13.3.11 whereas the asymptotic distributions are studied 
in Section 13.3.21 We provide an impossibility result regarding the estimation of 
the adaptive LASSO's distribution function in Section !?^ Section [5] studies the 
behavior of the adaptive LASSO estimator by Monte Carlo without imposing 
the simplifying assumption of orthogonal regressors. We finally summarize our 
findings in Section \5\ Proofs and some technical details are deferred to an 
appendix. 



2 The Adaptive LASSO Estimator 

We consider the linear regression model 

Y = xe + u (1) 

where X is a nonstochastic n x k matrix of rank k and u is multivariate normal 
with mean zero and variance-covariancc matrix cr^/„. Let 9ls — {X'X)^^X'Y 
denote the least squares (maximum likelihood) estimator. The adaptive LASSO 
estimator 6a is defined as the solution to the minimization problem 

k 

{Y - xey{Y - xe) + 2n^ll ^ m/\hsA (2) 

1=1 

where the tuning parameter /i„ is a positive real number. As long as ^ls.i ^ 
for every i, the function given by ^ is well-defined and strictly convex and 
hence has a uniquely defined minimizer 6a- [The event where Ols.i — for 
some i has probability zero under the probability measure governing u. Hence, 
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it is inconsequential how we define 9a on this event; for reasons of convenience, 
we shall adopt the convention that iJ,n/\9LS,i\ = if 0LS,i = 0. Furthermore, 
9a is a measurable functi on o f F.] Note that Zou ( 20061 ) uses A„ = 2n/Lt^ 
as the tuning parameter. Zou ( 20061 ) also considers versions of the adaptive 
LASSO estimator for which |^Ls,i| in (El) is replaced by I^LS/jT- However, we 
shal l exclu sively concentrate on the leading case 7=1. As pointed out in 
Zm] dlooi), the adaptive LA SSO is closely related to the nonnegative Garotte 



estimator of Breiman (|l995h . 



3 Theoretical Analysis 

For the theoretical analysis in this section we shall make some simplifying as- 
sumptions. First, we assume that is known, whence we may assume without 
loss of generality that — 1. Second, we assume orthogonal regressors, i.e., 
X'X is diagonal. The latter assumption will be removed in the Monte Carlo 
study in Section 2) Orthogonal regressors occur in many important settings, 
including wavelet regression or the analysis of variance. More specifically, we 
shall assume X'X — nl]^. In this case the minimization of ^ is equivalent to 
separately minimizing 

n{9LS,^ - e.,f + 2n^il\9,\/\9LsA (3) 

for i = 1, . . . , fc. Since the estimators ^Ls,i sre independent, so are the compo- 
nents of 9a, provided /i„ is nonrandom which we shall assume for the theoretical 
analysis throughout this section. To study the joint distribution of 9a, it hence 
suffices to study the distribution of the individual components. Hence, we may 
assume without loss 0/ generality that 9 is scalar, i.e., fc = 1, for the rest of 
this section. In fact, as is easily seen, there is then no loss of generality to 
even assume that X is just a column of I's, i.e., we may then consider a simple 
Gaussian location problem where 9ls — y, the arithmetic mean of the indepen- 
dent and identically A^(6', l)-distributed observations yi,...,?/„. Under these 
assumptions, the minimization problem defining the adaptive LASSO has an 
explicit solution of the form 

^ I v-i^l/y if \y\ >Mn- ^ ' 

The explicit formula (jl]) also shows that in the location model (and, more gener- 
ally, in the diagonal regression model) the adaptive LASSO and the nonnegative 
Garotte coincide, and thus the results in the present section also apply to the 
latter estimator. In view of (j4|) we also note that in the diagonal regression 
model the adaptive LASSO is nothing else than a positive-part Stein estimator 
applied componentwise. Of course, this is not in the spirit of Stein estimation. 
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3.1 Model selection probabilities and tuning parameter 



The adaptive LASSO estimator 9a can be viewed as performing a selection 
between the restricted model Mj^ consisting only of the -/V(0, l)-distribution 
and the unrestricted model Mu = {N{9, 1) : e M} in an obvious way, i.e., Mfi 
is selected if = and Mu is selected otherwise. We now study the model 
selection probabilities, i.e., the probabilities that model Mu or Mr, respectively, 
is selected. As these selection probabilities add up to one, it suffices to consider 
one of them. The probability of selecting the restricted model Mji is given by 

= ^{-n^^^0 + n^/^^l„)-^{-n^/^9-n^^^n„), (5) 

where Z is a standard normal random variable with cumulative distribution 
function (cdf) $. We use Pn^e to denote the probability governing a sample 
of size n when is the true parameter, and Pr to denote a generic probability 
measure. 

In the following we shall always impose the condition that /Lt„ — > for asymp- 
totic considerations, which guarantees that the probability of incorrectly select- 
ing the restricted model Mu (i.e., selecting Mu if the true 9 is non-zero) vanishes 
asymptotically. Conversely, if this probability vanishes asymptotically for every 
9 0, then ^„ — > follows, hence the condition ^ is a basic one and 
without it the estimator 9a does not seem to be of much interest. 

Given the condition that — > 0, two cases need to be distinguished: (i) 
n^/^^„ ^ m, < m < oo and (ii) —>■ oo0 In case (i), the adaptive 

LASSO estimator acts as a conservative model selection procedure, meaning 
that the probability of selecting the larger model Mjj has a positive limit even 
when 9 = 0, whereas in case (ii), 9a acts as a consistent model selection proce- 
dure, i.e., this probability vanishes in the limit when 9 = 0. This is immediately 
see n by inspecti o n of ( El). In different guise, these facts have long been kn own, 



see ' Bauer et al.l ( 19881 ). In his analysis of the adaptive LASSO estimator IZoul 



assumes v}/'^p,^ and ri^/^^„ o o, hence h e considers a subcase 
of case (ii). We shall discuss the reason why [zo3 (j2006f ) imposes the stricter 
condition ^ in Section r3. 3. 21 

The asymptotic behavior of the model selection probabilities discussed in the 
preceding paragraph is of a "pointwise" asymptotic nature in the sense that the 
value of 9 is held fixed when n —^ oo. Since pointwise asymptotic results often 
miss essential aspects of the finite-sample behavior, we next present a "moving 
parameter" asymptotic analysis, i.e., we allow 9 to vary with n in the asymptotic 
analysis, which better reveals the features of the problem in finite samples. Note 
that the following proposition in particular shows that the convergence of the 
model selection probability to its limit in a pointwise asymptotic analysis is not 
uniform in € R (in fact, it fails to be uniform in any neighborhood of 6* = 0). 



^ There is no loss in generality here in the sense that the general case where only fj,^ ^ 
holds can always be reduced to case (i) or case (ii) by passing to subsequences. 
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Proposition 1 Assume /i„ ^ and n^/^/i^ tn with < m < oo. 
(i) Assume < tn < cxd (corresponding to conservative model selection). Sup- 
pose that the true parameter 0„ € M satisfies n^/^0„ — > G M U {—00,00}. 
Then 

lim PnfiX^A = 0) = $(-i/ + m)-$(-j/-m). 



n — ^00 



(ii) Assume tn = 00 (corresponding to consistent model selection). Suppose 
0„ G R satisfies Qnl [i-n — ^ C ^ 11^ U {—00, 00}. Then 

1. Id < 1 implies lim„^oo Pn,9„{0A = 0) = 1, 

2. Id = 1 and n^^'^lfi^-^ ~ C^n) ^ ^ !or some r S M U {—00,00}, implies 
lim„^oo PnfiJOA = 0) $(r), 

3. Id > 1 imp/ies lim„^oo PnMr^i^A = 0) = 0. 

The proof of Propo sition [T] is identical to the proof of Proposition 1 in 



Potscher fc Leebl (|2007l ) and hence is omitted. The above proposition in fact 
completely describes the large-sample behavior of the model selection probabil- 
ity without any conditions on the parameter 9, in the sense that all possible 
accumulation points of the model selection probability along arbitrary sequences 
of On can be obtained in the following manner: Apply the result to subsequences 
and observe that, by compactness of RU{— 00, 00}, we can select from every sub- 
sequence a further subsequence such that all relevant quantities such as n^/'^On, 
Bnj l>'m "•^^^(^n ~ ^n); '^^ "^^^(Mk + ^n) couvergc in K U {—00,00} along this 
further subsequence. 

In the case of conservative model selection. Proposition [1] shows that the 
usual local alternative parameter sequences describe the asymptotic behavior. 
In particular, if 0„ is local to 6* = in the sense that 6'„ = z^/n^/^, the local 
alternatives parameter v governs the limiting model selection probability. De- 
viations of Bn from = of order l/v}/'^ are detected with positive probability 
asymptotically and deviations of larger order are detected with probability one 
asymptotically in this case. In the consistent model selection case, however, a 
different picture emerges. Here, Proposition [T] shows that local deviations of 0„ 
from 6* = that are of the order l/n^^^ are not detected by the model selection 
procedure at all0 In fact, even larger deviations from zero go asymptotically 
unnoticed by the model selection procedure, namely as long as 6'„//i„ C,, 
Id < 1- [Note that these larger deviations would be picked up by a conservative 
procedure with probability one asymptotically.] This unpleasant consequence of 
model selection consistency has a number of repercussions as we shall see later 
on. For a more detaile d discussion of these fact s in the context of post-model- 



selection estimators see lLeeb fc Potscheij ( 2005 ) 



The speed of convergence of the model selection probability to its limit in 
part (i) of the proposition is governed by the slower of the convergence speeds 
of and n^^^On. In part (ii), it is exponential in v}/"^^^ in cases 1 and 3, 



^For such deviations this also immediately follows from a contiguity argument. 
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and is governed by the convergence speed of n^^'^fi^ and — C^'ji) in case 

2. 

3.2 Uniform consistency and uniform convergence rate of 
the adaptive LASSO estimator 

It is easy to see that the natural condition /i„ ^ discussed in the preceding 
section is in fact equivalent to consistency of Oa for 9. Moreover, under this 
basic condition the estimator is even uniformly consistent with a certain rate as 
we show next. 

Theorem 2 Assume that ii^-^ — )■ 0. Then 9a is uniformly consistent for 9, i.e., 



lim sup Pn 



eA~e 



> e) = (6) 



for every e > 0. Furthermore, let a„ = min(n^/^,/i„^). Then, for every e > 0, 
there exists a (nonnegative) real number M such that 

supsupF„,e (an 9 A- 9 > m) < e (7) 

holds. In particular, 9a is uniformly On- consistent. 

For the case where the estimator 9a is tuned to perform conservative model 
selection, the preceding theorem shows that these estimators are uniformly n^/'^- 
consistent. In contrast, in case the estimators are tuned to perform consistent 
model selection, the theorem only guarantees uniform /i~ ^-consistency; that the 
estimator does actually not converge faster than /i„ in a uniform sense will be 
shown in Section r3. 3. 21 

Remark 3 In case n^/^/^„ m with m = 0, the adaptive LASSO estimator 
is uniformly asymptotically equivalent to the unrestricted maximum likelihood 
estimator y in the sense that supggR Pnfi{n^/'^\9A ~ y\ > e) ^ for n ^ cx) 
and for every e > 0. Using ^ this follows easily from 

Pnfi{T^^'^\^A -V\>e) < l(nl/V„ > e) + > \y\ > Mn) 

< 2 • > e) ^ 0. 

3.3 The distribution of the adaptive LASSO 
3.3.1 Finite-sample distributions 

We now derive the finite-sample distribution of n^^^(9A — 9). For purpose of 
comparison we note the obvious fact that the distribution of the unrestricted 
maximum likelihood estimator 9jj = y (corresponding to model Mjj) as well as 
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the distribution of the restricted maximum likehhood estimator 0^ = (corre- 
sponding to model Mji) are normal. More precisely, n^^'^{9u — 9) is A^(0, 1)- 
distributcd and n^^'^{9fi — 9) is -/V(— 0)-distributed, where the singular 
normal distribution is to be interpreted as pointmass at —n^/^9. [The latter is 
simply an instance of the fact that in case k > 1 the restricted estimator has a 
singular normal distribution concentrated on the subspace defined by the zero 
restrictions.] 

The finite-sample distribution FA,n,e of — 9) is given by 

PnAn^^^ih -9)<x) = PnA^^'HOA -0)<x,h = 0) 

+ PnAn^^^0A-9) <x,9a>0) 

+ PnM{n^'^{9A-9)<X,9A<Q) 

= A + B + C. 

By (O we clearly have 

A = l{^n^^^9<x) |$(-ni/2e' + ni/2^„)-$(-ni/26)-ni/V„)}. 
Furthermore, using expression ^ we find that 

B = PnAn'^^iy- f^l/y-s) <x,y> f^n) 

= Pnfi{n^''^{f - 9y) <yx,y > ^i^) 

= Pr(Z2 + n^/^9Z - n^l <Zx + n^^^9x, Z > -n^^^9 + n^/VrJ 

= Pr{Z^ + (n^/^gi _ _ (^^2 „i/2g)2,) < 0, Z > -n^/^e + n^^'^fij, 

where Z follows a standard normal distribution. The quadratic form in Z is 
convex and hence is less than or equal to zero precisely between the zeroes of 
the equation 

The solutions zl^l{x) and z'^l{x) of this equation with zl^l{x) < zl^l{x) are 
given by 

- {n^/^9 - x)/2 ± ^{{n^/^9 + x)/2Y + ntil. (8) 
Note that the expression under the root in fS]) is always positive, so that 

B = Pn,e (zl^lix) <Z< (x), Z > -n'^^9 + ni/V„) ■ 

Observe that zl^l{x) < —n^/'^9 + n^^'^^^^ always holds and that —n^/'^9 + 
n^^'^^^n — -^ife(^) equivalent to n^l'^9 + x >Q, so that we can write 

B = l{n^lH + x > 0) {$ (z^^lix)) - $(-ni/20 + ni/2/i„)} • 
The term C can be treated in a similar fashion to arrive at 
C = l{n^/^9 + x>0) <^{-n^l^9 - n^'^^i,,) + l{n^'^9 + x < 0) $ . 
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Adding up A, B and C, we now obtain the finite-sample distribution function 
of -61) as 

FA,n,9{x) = l{n'/^e + x > 0) $ (^(^(rr)) + x < 0) $ (z^Cx)) . (9) 

It follows that the distribution of n^^^{9A — 0) consists of an atomic part given 

by 

|$(„i/2(_0 + ^j) _ ^n^/^{-9 - ^„))} 6_„u2o, (10) 

where Sz represents pointmass at the point z, and an absolutely continuous part 
that has a Lebesgue density given by 

0.5 X + a; > 0) <^ (1 + tnA^)) + 

l(ni/20 + x<O)0(zf^5(^)) i^-tnA^))}^ (11) 

where i„,e(x) = 0.5{n^/^e + x)/ {{{n^/'^e + x)/2)'^ + nul)^^^ . Figure 1 illus- 
trates the shape of the finite-sample distribution of n^/^{6A — 9). Obviously, 
the distribution is highly non-normal. 



d 




-3-2-10 1 2 3 

Figure 1: Distribution of n^/'^{9A - 0) for n = 10, 6* = 0.1, = 0.05. The plot 
shows the density of the absolutely continuous part (|lip . as well as the total 
mass of the atomic part ((TO)) located at — = —0.32. 



3.3.2 Asymptotic distributions 

We next obtain the asymptotic distributions of 9a under general "moving pa- 
rameter" asymptotics (i.e., asymptotics where the true parameter can depend on 
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sample size) , since - as already noted earlier - considering only fixed-parameter 
asymptotics may paint a very misleading picture of the behavior of the estima- 
tor. In fact, the results given below amount to a complete description of all 
possible accumulation points of the finite-sample distribution, cf. Remarks [71 
Not surprisingly, the results in the conservative model selection case are different 
from the ones in the consistent model selection case. 



Conservative case The large-sample behavior of the distribution FA,n,e„ of 
n^^^i^A — dn) for the case when the estimator is tuned to perform conservative 
model selection is characterized in the following theorem. 

Theorem 4 Assume /i„ — > and ^ tn, < m < cx3. Suppose the true 

parameter On G K satisfies n^^^On v £ M (J {— oo.oo}. Then, for G M, 
FA,n,e„ converges weakly to the distribution 



2 -±- m2 



l{x + v> 0)$ — + d i^-y + m2 



+ l{x + ,.<0)^ \^ __y(__)2+^ ^ 

If l^l = oo, then FA,n,e„ converges weakly to $, i.e., to a standard normal 
distribution. 

The fixed-parameter asymptotic distribution can be obtained from Theo- 
rem |4] by setting 0n=O: For 6 = 0, we get 



l(x > 0) $(x/2 4^ ^J{x/2f+m^) + l{x < 0) $(a;/2 - {x/2f + m^), 

which coincides with the finite-sample distribution in ^ except for replacing 
with its limit m. However, for 9^0, the resulting fixed-parameter 
asymptotic distribution is a standard normal distribution which clearly misrep- 
resents the actual distribution This disagreement is most pronounced in 
the statistically interesting case where 6 is close to, but not equal to, zero (e.g., 
9 ~ n~^/^). In contrast, the distribution given in Theorem |4] much better cap- 
tures the behavior of the finite-sample distribution also in this case because it 
coincides with the finite-sample distribution ^ except for the fact that v}^'^^^ 
and n}^^9n have settled down to their limiting values. 



Consistent case In this subsection we consider the case where the tuning 
parameter /i„ is chosen so that 9a performs consistent model selection, i.e. 
^„ ^ and n^^^fin — > oo. 

Theorem 5 Assume that /i„ and n^/"^ oo. Suppose the true param- 
eter 9n G K satisfies 9n/ ti-n ^ C foT some G M U {— oo, oo}. 
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1. // C = and n^^^9n v £ 'R, then Fa n 9 converges weakly to the cdf 

2. The total mass FA,n,9„ escapes to either oo or —oo for the following cases: 
If — oo < C < 0, or if ( = and n^^^On —oo, or if C, — — oo and 
n}/'^li^/9n — oo, then FA^n^A^) for every a; G M. If <(< oo, 
or if — and n^^^dn —^ oo, or if Q = oo and n^/'^ ^"^-JOn —> oo, then 
FA,n.9n{^) ~* 1 for every a; € R. 

3. If Id = oo and n^^'^^'^/On ^ r G M, then FA,n,e„ converges weakly to the 
cd/$(- + r). 

The fixed-parameter asymptotic behavior of the adaptive LASSO estimator 
is obtained from Theorem \5\ by setting 9n = 0- For 6^0, the asymptotic 
distribution reduces to point-mass at 0, which coincides with the asymptotic 
distribution of the restricted maximum hkelihood estimator. In the case of 
9^0, the asymptotic distribution is $(a; + p/9) provided n^/^fi^ p (with 
the obvious interpretation if \p\ = oo). That is, it is a shifted version of the 
asymptotic distribution of the unrestricted maximum likehhood estimator (the 
shift being infinitely large if |p| = oo). Observe that (for \p\ < oo) the shift gets 
larger as \9\, \9\ ^ 0, gets smaller. The 'oracle' property in the sense oflz^ 
IS hence satisfied if and only if p = 0, that is, if the tuning parameter 
additionally also sati s fies n^ /'^p,„ 0. This is precisely the condition imposed 



in Theorem 2 in Zou which establishes the 'oracle' property. [Note that 



n}/'^p, translates into the assumption Xn/n^^"^ ^ in Theorem 2 in I Zou 
(|2006h .] :f n^/^/x^ p ^ 0, the adaptive LASSO estimator provides an example 



of an estimator that performs consistent model selection, but does not satisfy 
the 'oracle' property in the sense that for 7^ its asymptotic distribution 
does not coincide with the asymptotic distribution of the unrestricted maximum 
likelihood estimator. 

In any case, the 'oracle' property, which is guaranteed under the additional 
requirement n}/^pL^^ — > 0, carries little statistical meaning: Imposing the addi- 
tional condition n^^^p.^^ still allows all three cases in Theorem [5] above to 
occur, showing that - notwithstanding the validity of the 'oracle' property - 
non-normal limiting distributions arise under a moving-parameter asymptotic 
framework. These latter distributions are in better agreement with the features 
exhibited by the finite-sample distribution ([9]), whereas the 'oracle' property 
always predicts a normal limiting distribution (a singular one in case 9 — 0), 
showing that it does not capture essential features of the finite-sample dis- 
tribution. In particular, the preceding theorem shows that the estimator is 
not uniformly n^/^-consistent as the sequence of finite-sample distributions of 
n^/^(^^ — 9n) is stochastically unbounded in some cases arising in Theorem [5j 
All this goes to show that the 'oracle' property, which is based on the pointwise 
asymptotic distribution only, paints a highly misleading picture of the behavior 
of the adaptive LASSO estimator and should not be taken at face value. See 
also Remark [TUl 



11 



It transpires from Theorem [S] that FA.n,e„ converges weakly to the singular 
normal distribution N{0, 0) if 6'„ = for all n, and to the standard normal 
A^(0, 1) if 9n satisfies |6'„| /(n^/^/^^) — > oo. Hence, if one, for example, allows 
as the parameter space for 6 only the set 8„ = {6 E R : 9 = or \9\ > bn} 
where 6„ > satisfies 6„/(ri^/^/i^) oo, then the convergence of FA.n,e„ to 
the limiting distributions -/V(0, 0) and A^(0, 1), respectively, is uniform over 0„, 
i.e., the 'oracle' property holds uniformly over 0„. Does this line of reasoning 
restore the credibility of the 'oracle' property? We do not think so for the 
following reasons: The choice of 0„ as the parameter space is highly artificial, 
depends on sample size as well as on the tuning parameter (and hence on the 
estimation procedure). Furthermore, in case 6„ is adopted as the parameter 
space, the 'forbidden' set K— 6„ will always have a diameter that is of order 
larger than n~^/^; in fact, it will always contain elements 0„ ^ such that 
On would be correctly classified as non-zero with probability converging to one 
by the adaptive LASSO procedure used, i.e., Pufir^i^A 7^ 0) ^ 1 (to see this 
note that M— 0„ contains elements On satisfying f?n//i„ C with 1(^1 > 1 and 
use Proposition 1). This shows that adopting 8^ as the parameter space rules 
out values of that are substantially different from zero, and not only values 
of that are difficult to statistically distinguish from zero; consequently the 
'forbidden' set is sizable. Summarizing, there appears to be little reason why 
@n would be a natural choice of parameter space, especially in the context of 
model selection where interest naturally focusses on the neighborhood of zero. 
We therefore believe that using 8„ as the parameter space is hardly supported 
by statistical reasoning but is more reflective of a search for conditions that are 
favorable to the 'oracle' property. 

As mentioned above. Theorem [5] shows, in particular, that 9a is not uni- 
formly n^/^-consistent. This prompts the question of the behavior of the dis- 
tribution of Cn{9 A — 9n) Under a sequence of norming constants c„ that are 
o(?T.^/^). Inspection of the proof of Theorem [5] reveals that the stochastic un- 
boundedness phenomenon persists if c„ is o(n^/^) but is of order larger than 
For c„ = 0{fi~^), we always have stochastic boundedness by Theorem[2] Hence, 
the uniform convergence rate of 9a is seen to be which is slower than 
The precise limit distributions of the estimator under the scaling c„ ~ is 
obtained in the next theorem. [The case c„ = o(/i~^) is trivial since then these 
limits are always pointmass at zero in view of Theorem [^j^ A consequence of 
the next theorem is that with such a scaling the pointwise limiting distributions 
always degenerate to pointmass at zero. This points to something of a dilemma 
with the adaptive LASSO estimator when tuned to perform consistent model 
selection: If we scale the estimator by i.e., by the 'right' uniform rate, the 
pointwise limiting distributions degenerate to pointmass at zero. If we scale the 
estimator by n^/^, which is the 'right' pointwise rate (at least if n^^^Hn 0), 
then we end up with stochastically unbounded sequences of distributions under 

^There is no loss in generality here in the sense that the general case where c„ = 0{fi^^) 
holds can — by passing to subsequences - always be reduced to the cases where c„ ~ fi~^ or 
Cn = o(/i~^) holds. 
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a moving parameter asymptotic framework (for certain sequences 

Let GA,n,9 stand for the finite-sample distribution of ijl~^{6a — 0) under 
Pn,e- Clearly, GA,n,9{x) — i^^^„_e(n^/^/i„a;). The limits of this distribution 
under 'moving parameter' asymptotics are given in the subsequent theorem. It 
turns out that the limiting distributions are always pointmasses, however, not 
always located at zero. 

Theorem 6 Assume that /.t„ — > 0, ri'^^'^iJ.j^ — > oo, and that 0n/M„ — > C for some 
C e KU {-oo,cx3}. 

1. //Id < 1; then GA,n.9n converges weakly to the cdf 1{- > —()■ 

2. If 1 < Id < oo, then GA,n,dn converges weakly the cdf 1{- > — 

3. If Id = oo, then GA,n,9n converges weakly to the cdf 1(- > 0). 



3.3.3 Some Remarks 

Remark 7 Theorems S] and [5] actually completely describe all accumulation 
points of the finite-sample distribution of n^/'^{pA ~ ^n) without any condition 
on the sequence of parameters To see this, just apply the theorems to 
subsequences and note that by compactness ofRU{— cxd,oo} we can select from 
every subsequence a further subsequence such that the relevant quantities like 
n^/'^Om QnllJ^n^ and n^/^^^/0„ converge in R U {— oo,cxd} along this further 
subsequence. A similar comment also applies to Theorem |6l 

Remark 8 As a point of interest we note that the full complexity of the pos- 
sible limiting distributions in Theorems IH [H and [H] already arises if we restrict 
the sequences 0„ to a bounded neighborhood of zero. Hence, the phenomena 
described by the above theorems are of a local nature, and are not tied in any 
way to the unboundedness of the parameter space. 

Remark 9 In case the estimator is tuned to perform consistent model selec- 
tion, it is mainly the behavior of 0n//^„ that governs the form of the limiting 
distributions in Theorems [5] and El Note that On/Hn is of smaller order than 
n^/'^On because — > cx) in the consistent case. Hence, an analysis rely- 
ing only on the classical local asymptotics based on perturbations of of the 
order of n~^/^ does not properly reveal all possible limits of the finite-sample 
distributions in that case. [This is in contrast to the conservative case, where 
classical local asymptotics reveal all possible limit distributions.] 

Remark 10 The mathematical reason for the failure of the pointwise asymp- 
totic distributions to capture the behavior of the finite-sample distributions well 
is that the convergence of t he latter to the former is not u niform in the under- 
lying parameter 6* e R. See Leeb fc Potscher ( 2003 , 20051) for more discussion 



in the context of post-model-selection estimators. 
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Remark 11 The theoretical analysis has been restricted to the case of orthog- 
onal regressors. In the case of correlated regressors we can expect to see similar 
phenomena (e.g., non- normality of finite-sample cdfs, non- uniformity problems, 
etc.), although details will be different. Evidence for this is provided by the 
simulation study presented in Section 21 by c orresponding theo r etical results 
for a c lass of post-model-selection estimators ( Leeb k. Potscher ( 20031 l2006a . 
looiH)) in the correlated regressor case a s well as by general resu l ts on es- 
timato rs possessing the sparsity property (jLeeb fc Potschen (|2008al ). iPotscher 
(2003)). 



3.4 Impossibility results for estimating the distribution of 
the adaptive LASSO 

Since the cdf FA.n,e of n^/^(0^ — 9) depends on the unknown parameter, as 
shown in Section [3.3.H one might be interested in estimating this cdf. We show 
that this is an intrinsically difficult estimation problem in the sense that the 
cdf cannot be estimated in a uniformly consistent fashion. In the following, 
we provide large-sample results that cover both consistent and conservative 
choices of the tuning parameter, as well as finite-sample results that hold for 
any choice of tun ing para meter. For related resu l ts in different contexts see 
Leeb fc Potscheil (2006a,b, l2008al) , IPotscheil (|2006l ) , IPotscher fc Leebl (|2007f ) . 

It is straightforward to construct consistent estimators for the distribution 
FA,n,e of the (centered and scaled) estimator 6 a- One popular choice is to use 
subsampling or the m out of n bootstrap with m/n — > 0. Another possibility 
is to use the pointwise large-sample limit distributions derived in Section 13.3.21 
together with a properly chosen pre-test of the hypothesis = versus 6* 7^ 0. 
Because the pointwise large-sample limit distribution takes only two different 
functional forms depending on whether 9 — Q ov 6 ^ one can perform a pre- 
test that rejects the hypothesis = in case \y\ > n~^l^^ say, and estimate the 
finite-sample distribution by that large-sample limit formula that corresponds 
to the outcome of the pre-testQ the test's critical value n~'^l'^ ensures that the 
correct large-sample limit formula is selected with probability approaching one 
as sample size increases. However, as we show next, any consistent estimator of 
the cdf FA,n,9 is necessarily badly behaved in a worst-case sense. 



Theorem 12 Let be a sequence of tuning parameters such that /i„ — s- and 
m with < m < 00. Let t e R 6e arbitrary. Then every consistent 
estimator Fn{t) of FA,n,e{t) satisfies 



lim sup Pn,e{ F,i{t) - FA,n.e{t) > e] =1 

"^°°|0|<c/«l/2 V ■ / 

for each e < (<f>(t-|-m) — ^(t — m))/2 and each c > \t\. Ln particular, no uniformly 
consistent estimator for FA,n,0{t) exists. 



*In the conservative case, the asymptotic distribution can also depend on m which is then 
to be replaced by n^f^ii^. 
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We stress that the above result also applies to any kind of bootstrap- or 
su bsampling-based e stimat or of the cdf FA,n,e whatsoever, since the results 
Leeb fc Potschei ( 2006tJ) on which the proof of Theorem [T^ rests apply t o 



arbitrary randomized estimators, cf. Lemma 3.6 in iLeeb fc Potschen ( 2006bl ). 
The same apphes to Theorems [T51 and [Til that follow. 

Loosely speaking, Theorem [1^ states that any consistent estimator for the 
cdf FA,n.9 suffers from an unavoidable worst-case error of at least e with e < 
-I- m) - $(t - m))/2. The error range, i.e., (<I>(t -I- m) - - m))/2, is 
governed by the limit m = lim„ 7i^/^/i„. In case the estimator is tuned to be 
consistent, i.e., in case m — oo, the error range equals 1/2, and the phenomenon 
is most pronounced. If the estimator is tuned to be conservative so that m < oo, 
the error range is less than 1/2 but can still be substantial. Only in case m = 
the error range equals zero, and the condition e < {^{t -f- m) — $(i — tn))/2 
in Theorem 1121 leads to a trivial conclusion. This is, however, not surprising 
as then the resulting estimator is uniformly asymptotically equivalent to the 
unrestricted maximum likelihood estimator y, cf. Remark [31 

A similar non-uniformity phenomenon as described in TheoremlT^for consis- 
tent estimators also occurs for not necessarily consistent estimators. For 
such arbitrary estimators we find in the following that the phenomenon can be 
somewhat less pronounced, in the sense that the lower bound is now 1/2 instead 
of 1, cf. (|13p below. The following theorem gives a large-sample limit result that 
parallels Theorem [T^l as well as a finite-sample result, both for arbitrary (and 
not necessarily consistent) estimators of the cdf. 

Theorem 13 Let < /i„ < oo and lettCzR- be arbitrary. Then every estimator 
Fn{t) of FA,n,eit) satisfies 



sup Pn,, 
|e|<c/ni/2 



Fn{t) - FA,^,e{t) > e) > ^ (12) 



for each e < ($(i + n^/^^„) — ^{t — n^/^^„))/2, for each c > \t\, and for each 
fixed sample size n. If satisfies /i„ — > and n^/'^^^ — s- m as n ^ oo with 
< m < oo, we thus have 

liminf inf sup Pn,e ( Fn{t) ~ FA,nAi) > e) > \ (13) 

F^{t) |e|<c/ni/2 V / 2 

for each e < {^{t + m) — ^{t — Tn))/2 and for each c > \t\, where the infimum 
in nisi) extends over all estimators Fn{t). 

The finite-sample statement in Theorem 1131 clearlv reveals how the estima- 
bility of the cdf of the estimator depends on the tuning parameter A larger 
value of which results in a 'more sparse' estimator in view of ([5]), directly 
corresponds to a larger range ($(i + n^/^/i„) — $(t — n^/^/i„))/2 for the error e 
within which any estimator Fn (t) performs poorly in the sense of (|12p . In large 
samples, the limit m = lim„_,oo n^^'^lJ-n takes the role of 

An impossibility result paralleling TheoremlTSlfor the cdf GA,n,e{t) of fi^^{dA~ 
0) is given next. 
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Theorem 14 Let < /i„ < oo and lett eM. be arbitrary. Then every estimator 
Gn(t) of GA.7i,e{t) satisfies 



sup Pn,e ( Gn{t) - GA,n,0{t) 



> e 



1 

> 

- 2 



(14) 



for each e < (<i>(n^/2/i„(i + 1)) - <I>(n^/^/i„(i - l)))/2, for each c > and for 
each fixed sample size n. If /i„ satisfies /i„ — > and — s- oo as n ^ oo, 
we thus have for each c > \t\ 



liminf inf sup (?«(*)- GA.n,e(i) 

G„(t) |e|<cAi„ 



> e 



> 



(15) 



/or eac/i e < 1/2 i/ |t| < 1 and for each e < 1/4 if \t\ — 1, where the infimum in 
il5\] extends over all estimators Gn{t). 

This result shows, in particular, that no uniformly consistent estimator ex- 
ists for GA,n,e{t) in case \t\ < 1 (not even over compact subsets of K con- 
taining the origin). In view of Theorem [SI we see that for t > 1 we have 
supgg]g \GA,n,e{t) — 1| ^ as n ^ oo, hence Gn{t) = 1 is trivially a uniformly 
consistent estimator in this case. Similarly, fort < ~1 wehave supg^^\GA,n,e{t) \ ^ 
as n ^ oo, hence Gn{t) = is trivially a uniformly consistent estimator in 
this case. 



4 Some Monte Carlo Results 

We provide simulation results for the finite-sample distribution of the adaptive 
LASSO estimator in the case of non- orthogonal regressors to complement our 
theoretical findings for the orthogonal case. We present our results by showing 
the marginal distribution for each component of the centered and scaled estima- 
tor. Not surprisingly, the graphs exhibit the same highly non-normal features of 
the corresponding finite-sample distribution of the estimator derived in Section 
13.31 for the case of orthogonal regressors. 

The simulations were carried out the following way. We consider 1000 repe- 
titions of n simulated data points from the model ([T|) with cr^ = 1 and X such 
that X'X — nfl with fi^ = 0.5'*~-'l for i,j = 1, . . . , fc. More concretely, X was 
partitioned into d = n/k blocks of size fc x fc (where d is assumed to be integer) 
and each of these blocks was set equal to k^^'^L, with LL' = fi, the Cholesky 
factorization of fl. We used fc = 4 regressors and various values of the true 
parameter 9 given hy 6 — (3, 1.5, 70.^^/^, 771^^/^)' where 7 = 0, 1, 2. This model 
with 9 = (3, 1.5, 0, 0)' (i.e. , 7 = 0) is a down s ized version of a mode l considere d 
in Monte Carlo studies in iTibshiranl (|l996l ). iFan fc Lil (|200l[) . and[Zo3 (|2006l ). 



For apparent reasons it is of interest to investigate the performance of the es- 
timator not only at a single parameter value, but also at other (neighboring) 
points in the parameter space. The cases with 7 7^ 0, represent the statistically 
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interesting case where some components of the true parameter value are close 
to but not equal to zero. 

For each simulatio n, we computed the adaptive LASSO estimator 6a using 
the LARS package of lEfron et alj (|2004l ) in R. Each component of the esti- 

mator was centered and scaled, i.e., C^^ {(^A,j — 6j) was computed, where 
C — (nf2)^^. The tuning parameter /i„ was chosen in two different ways. In 
the first case, it was set to the fixed value of /i„ = n^^/'^, a choice that cor- 
responds to consistent mode l selec tion and additionally satisfies the condition 
n^/*/x„ — > required in Zou ( 20061 ) to obtain the 'oracle' property. In the sec- 
ond case, in each simulation the tuning parameter was selected to minimize a 
mean-squared prediction error obtained through K-io\d cross-validation (which 
can be computed using the LARS package, in our case with K = 10). 

The results for both choices of the tuning parameters, for n — 100, and 
7 = 0, 1, 2 are shown in Figures 2-7 below. For each component of the estima- 
tor, the discrete component of the distribution corresponding to the zero values 

1/2 

of the j-th component of the estimator Oaj (appearing at —C^j 9j for the 
centered and scaled estimator) is represented by a dot drawn at the height of 
the corresponding relative frequency. The histogram formed from the remaining 
values of Cjj {(^aj ~ ^j) was then smoothed by the kernel smoother available 
in R, resulting in the curves representing the density of the absolutely contin- 

— 1/2 ^ 

uous part of the finite-sample distribution of Cjj {Oaj — dj)- Naturally, in 
these plots the density was rescaled by the appropriate relative frequency of the 
estimator not being equal to zero. 

We first discuss the case where the tuning parameter is set at the fixed value 
/i„ = n^^/'^. For 7 = 0, i.e., the case where the last two components of the 
true parameter are identically zero, Figure 2 shows that the adaptive LASSO 
estimator finds the zero components in 6* = (3, 1.5,0,0)' with probability close 

— 1/2 " 

to one (i.e., the distributions of Cjj {^Aj — dj), j = 3,4, practically coincide 
with pointmass at 0). Furthermore, the distributions of the first two components 
seem to somewhat resemble normality. The outcome in this case is hence roughly 
in line with what the 'oracle' property predicts. This is due to the fact that the 

— 1/2 

components of 9 are either zero or large (note that Cjj 9j is approximately 
equal to 26 and 12, respectively, for j = 1,2). The results are quite different 
for the cases 7 = 1 and 7 = 2 (Figures 3 and 4), which represent the case 
where some of the components of the parameter vector 9 are large and some are 
different from zero but small (note that C^r^''^9-i « 0.777 and C^I''^9a » O.877). 
In both cases the distributions of C^^ ' {9A,j —9j),j = 3, 4, are a mixture of an 
atomic part and an absolutely continuous part, both shifted to the left of the 
origin. Furthermore, the absolutely continuous part appears to be highly non- 
normal. This is perfectly in line with the theoretical results obtained in Section 
13.31 It once again demonstrates that the 'oracle' property gives a misleading 
impression of the actual performance of the estimator. 

In the case where the tuning parameter is chosen by cross-validation, a 
similar picture emerges, except for the fact that in case 7 = the adaptive 
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Figure 2: Marginal distributions of the scaled and centered adaptive LASSO 
estimator for n = 100, 7 = 0, i.e., 6 = (3, 1.5,0,0)', and ^„ = n^^/^ = 0.22. 




Figure 3: Marginal distributions of the scaled and centered adaptive LASSO 
estimator for n = 100, 7=1, i.e., 9 = (3, 1.5,0.1,0.1)', and /x„ = n-^'^ = 0.22. 
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Figure 4: Marginal distributions of the scaled and centered adaptive LASSO 
estimator for n = 100, 7 = 2, i.e., 9 = (3, 1.5,0.2,0.2)', and At„ = n-'^l'^ = 0.22. 



LASSO estimator now finds the zero component less frequently, cf. Figure 5. 
[In fact, the probability of finding a zero value of Oa,] for j = 3,4 is smaller in the 
cross- validated case regardless of the value of 7 considered.] The reason for this 
is that the tuning parameters obtained through cross-validation were typically 
found to be smaller than n~^/'^, resulting in an estimator 9a that acts more 
like a conservative rather than a consistent model s election proce dure. [This is 
in line with theoretical results in Leng et al. ( 20061 ). see also Lee b fc Potscheij 
(20083).] n agreement with the theoretical results in Section [373l the absolutely 

— 1/2 " 

continuous components of the distributions of C^^ {9A,j — 9j) are now typically 
highly non-normal, especially for j — 3, 4, cf. Figures 5-7. [Note that cross- 
validation leads to a data-depending tuning parameter a situation that is 
strictly speaking not covered by the theoretical results.] 

We have also experimented with other values of 9 such eis9 — (3, 1.5, 7n~^/^, 0)' 
OT 9 — (3, 1.5, 0, jn^-^/^ )', other values of 7 and other sample sizes such as n — 60 
or 200. The results were found to be qualitatively the same. 



5 Conclusion 

We have studied the distribution of the adaptive LASSO estimator, a penalized 
least squares estimator introduced in ,Zou (2006) , in finite-samples as well as in 
the large-sample limit. The theoretical study assumes an orthogonal regression 
model. The finite-sample distribution was found to be a mixture of a singular 
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Figure 5: Marginal distributions of the scaled and centered adaptive LASSO 
estimator for n — 100, 7 = 0, i.e., 6 — (3,1.5,0,0)', and fi^ chosen by cross- 
vaHdation. 
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Figure 6: Marginal distributions of the scaled and centered adaptive LASSO 
estimator for n = 100, 7 = 1, i.e., 9 — (3,1.5,0.1,0.1)', and fi^ chosen by 
cross-validation. 
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Figure 7: Marginal distributions of the scaled and centered adaptive LASSO 
estimator for n = 100, 7 = 2, i.e., 9 = (3,1.5,0.2,0.2)', and chosen by 
cross-validation. 

normal distribution and an absolutely continuous distribution, which is non- 
normal. The large-sample limit of the distributions depends on the choice of 
the estimator's tuning parameter, and we can distinguish two cases: 

In the first case the tuning is such that the estimator acts as a conserva- 
tive model selector. In this case, the adaptive LASSO estimator is found to 
be uniformly 7i^/^-consistent. We also show that fixed-parameter asymptotics 
(where the true parameter remains fixed while sample size increases) only par- 
tially reflect the actual behavior of the distribution whereas "moving-parameter" 
asymptotics (where the true parameter may depend on sample size) gives a more 
accurate picture. The moving-parameter analysis shows that the distribution 
may be highly non-normal irrespective of sample size, in particular, in the sta- 
tistically interesting case where the true parameter is close (in an appropriate 
sense) to a lower-dimensional submodel. This also implies that the finite-sample 
phenomena that we have observed can occur at any sample size. 

In the second case, where the estimator is tuned to perform consistent model 
selection, again fixed-parameter asymptotics do not capture the whole range of 
large-sample phenomena that can occur. With 'moving parameter' asymptotics, 
we have shown that the distribution of these estimators can again be highly non- 
normal, even in large samples. In addition, we have found that the observed 
finite-sample phenomena not only can persist but actually can be more pro- 
nounced for larger sample sizes. For example, the distribution of the estimator 
(properly centered and scaled by n^^^) can diverge in the sense that all its mass 
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escapes to either +00 or — cx3. In fact, we have estabhshed that the uniform 
convergence rate of the adaptive LASSO estimator is slower than n~^/^ in the 
consistent model selection case. These findings are especia lly important as the 
adaptive LASSO estimator has been shown in Zou ( 20061 ) to possess an 'ora- 
cle' property (under an additional assumption on the tuning parameter), which 
promises a convergence rate of n~^^^ and a normal distribution in large sam- 
ples. However, the 'oracle' property is based on a fixed-parameter asymptotic 
argument which, as our results show, gives highly misleading results. 

The findings mentioned above are based on a theoretical analysis (Section 
[3]) of the adaptive LASSO estimator in an orthogonal linear regression model. 
The orthogonality restriction is removed in the Monte Carlo analysis in Section 
[31 The results from this simulation study confirm the theoretical results. 

Finally, we have studied the problem of estimating the cdf of the (centered 
and scaled) adaptive LASSO estimator. We have shown that this cdf cannot be 
estimated in a uniformly consistent fashion, even though pointwise consistent 
estimators can be constructed with relative ease. 

We would like to stress that our results should not be read as a condemnation 
of the adaptive LASSO estimator, but as a warning that the distributional 
properties of this estimator are quite intricate and complex. 



A Appendix 



Proof of Theorem [2} Since ^ implies ([6]), it suffices to prove the former. 
For this, it is ins tructive to write 9a in te rms of the hard-thresholding estimator 
9h as defined in Potscher fc Leebl (|2007l ) (with 7y„ = /Lt„) by observing that 



Here sign(a;) = —1, 0, 1 dependin g on whether x < 0, = 0, > 0. Since 9h satisfies 
([7]) as is shown in Theorem 2 in lPotscher fc Leebl (|2007l ). it suffices to consider 



supP„,e(a„|^H -9a\> M) = supP„,e(a„^2/|y| > ^ g) 

= supP„,e(a„^^/|y| > j^,f^ |y| > 

< HanUn > M). 

Since a„/x„ < 1, the right-hand side in the above expression equals zero for any 
M > 1. ■ 

Proposition 15 Let 6'„ G M and < /i„ < 00. If On/fJ^n ^ "'^'^ n^^^On 
—00, then z^l^{x) — x ^ n^/'^^1^/9n as n 00 for every a; G M. //6'„///„ 00 
and n^/-^9n — > 00, then z'^g (x) — x n^/'^fi'^/9n for every a; G M. 
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Proof. We prove the first claim. We can write 



zl!iSx)-x = -(ni/20„ + x)/2 - •^((ni/20„ + x)/2)^ + n^il 
= ni/2a„(x) {-f + + {Pn/(^n{x)Y] 

with n}^'^an{x) — {n^^'^On + x)/2 where the last equality holds for large n since 
n^/^a„(a;) < eventually. Through an expansion of \/l + z about zero, we 
obtain 

41 (^)-^ - n'/'U/a^{x)){l + z^)-^/y2 

= {n'/^fil/en)il + x/{n'/^en))-\l + z„)-i/2, 

withO <Zn< (^„/a„(a;))2. Note that ^„/a„(a;) = 2{fi^/en){l+x/ {n^/'^9n)y^ - 
0, and hence z„ — > holds. The claim now follows. The second claim is proved 
analogously. ■ 

Proof of Theorem [4) We derive the corresponding asymptotic distribu- 
tions by studying the limit behavior of © with 9 replaced by 6'„. If e M the 
result immediately follows, since FA,n,en i^) converges to the limit given above 
for every x ^ — f as a consequence of ^ and n^/^6'„ —> v. For the case v = oo, 
note that the indicator function of the first term in ([9]) goes to 1 for every a; € K, 
whereas the second one goes to 0. Furthermore, we clearly have ^n/Mn ~^ oo 
since < m < oo holds. Therefore we can apply Proposition [15] to find that 
(^) ~* ^ since n^/^^^/^n — n^/^^„(/^„/6'„) — > tn • = 0. This implies that 
FA,n,e{x) ~> $(a;) for all a; e M in case v = oo. A similar argument can be made 
to prove the claim for v = —oo. ■ 

Proof of Theorem [5} If |C| < 1, Proposition [T] shows that the total mass 
of the atomic part PU)) of the distribution FA,nfin goes to 1; furthermore, the 
location of the atomic part, i.e., —n^^^On, then converges to —v € M or to ±oo. 
This proves the theorem in case |C| < 1. We prove the remaining cases by 
inspecting the limit behavior of ([5]), again with 9n replacing 9. To derive the 
limits for 1 < |CI < oo, note that n^/^6'„ sign(C)oo, so that by assessing 
the limit of the indicator functions in ([5]), it can easily be seen that i^A,n,e„(a^) 
converges to the limit of ^{z^n \ (^)) for ^ > and to the limit of ^{z^^l (x)) 

(2) 

for C < 0. Elementary calculations show that z^ g (x) oo for 1 < C < oo and 
that zl^^'g (x) — > — oo for — oo<C<— 1. As a consequence of Proposition [T5l 
also zl^\^{x) ^ oo if C = oo and n^^'^^'^/9n — > oo; similarly, z^l^{x) — oo if 
^ — — oo and v}^"^ ^'^/9n — oo. This then proves the remaining cases in part 
2. Under the assumptions of part 3, an application of Proposition 1151 gives that 
(^) ^2; + rif^ = oo and that z^^g (x) —> x + r if — — oo, which then 
proves part 3. I 
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Proof of Theorem [6} To prove part 1, observe that Proposition [T] implies 
lim„^oo Pn,e„CdA = 0) = 1 for |C| < 1. This entails 

lim Pr,.eJ^l-\eA - e„) <x)= lim Pn.e„{pL-\eA - On) <x,eA = 0) 

= lim l(-6l„/^„ < a;) = l(a; > -() 

n — >oo 

for X 7^ — C, which establishes part 1. Next, observe that 

Gyi,n,e„(a;) = 

l(^n/A*„ + ^ > miw'nlS^)) + l(en/M„ < m{y^nlS^)) (16) 

where w^^'^g^ {x) and w^^^^ {x) with it;^^^^ {x) < w^^g^ (x) are given by 

n'^'^l^[{~9n/^^n + ^)±^/K/^^^+WTi}/2. (17) 

Under the conditions of part 2, the first indicator function in p6)) tends to 
1 for X > —( and to for a; < Consequently, GA,n,e„{x) converges 

to lim„^oo^{w^nl^{x)) if X > and to lim„^oo ^(^1^^^ (a^)) if a; < -C 
(provided the limits exist). Elementary calculations show that for C > 1 we 
have wl^g^{x) ~oo for all x g R, w^^g^(x) — > — oo for x < — 1/C, and 
wl^\^{x) ^ oo for X > — For C < — 1 we obtain wl^g^{x) ~oo for 

X < —1/C, w^nl (a;) ^ oo for X > — 1/C, and (x) oo for all x G K. Con- 
sequently, for X ^ — C, we find GA,n,e,i{x) ^Oforx<— 1/C and GA,n,e^{x) — > 1 
for x>— 1/C. If|C| = l, the result in part 2 follows. If |C| > 1, convergence 
of GA,n,e„{^C) to the proper limit follows from monotonicity of GA,n,9„ and 
the fact that x = — C is a continuity point of the limit distribution. This then 
completes the proof of part 2. 

For part 3 we consider first the case C = oo. Clearly, GA,n,e,S^) converges 
to lim„ $(w;^^^^(x)). Since 

^niM = '^'^Vn {{-On/f^n + + V (On/ + ^'P + ^} /2 

by pT]) . and because 0,i/^„ — > oo, it is easy to see that g (x) converges to 
oo if X > and to — oo if x < 0. The case where C = ^oo is proved analogously. 
■ 

Proof of Theorem [HJ Let e^iS) be short-hand for -{t + 6)/n^/^. Ele- 
mentary calculations show that 

lim |i^^,„,9„(_5)(t) ~ F^,„.e„(5) W| = Ht + n^'^l^n) - ~ "'^Vn)- (18) 

oj.0 

In particular, this implies that the supremum of |-FA,Ti,e„(-(5) (i) ~ FA.n,en(5){i)\ 
over < 5 < c — |t| is bounded from below by <I>(t -I- •n}/'^^^) ~ $(i — r7,^/^/i„). 
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T he rest of the argument then proceeds shiiilar as in the proof of Theorem 13 



m 



Potscher fc Leebl ([20071) 



Pr oof of Theorem ll3t Analogous to the p roof of Theorem 1 4 in Potscher fc Leebl 
(|2007l) except for using (0 in place of (11) in lPotscher fc Leebl (l2007i). ■ 

Pr oof of Theorem ll4t Analogous to the proof of Theorem 18 in Potscher fc Leebl 
(|2007l) . ■ 
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